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Abstract 

Background: Due to rapid sequencing of genomes, there are now millions of deposited protein sequences with 
no known function. Fast sequence-based comparisons allow detecting close homologs for a protein of interest to 
transfer functional information from the homologs to the given protein. Sequence-based comparison cannot 
detect remote homologs, in which evolution has adjusted the sequence while largely preserving structure. 
Structure-based comparisons can detect remote homologs but most methods for doing so are too expensive to 
apply at a large scale over structural databases of proteins. Recently, fragment-based structural representations 
have been proposed that allow fast detection of remote homologs with reasonable accuracy. These 
representations have also been used to obtain linearly-reducible maps of protein structure space. It has been 
shown, as additionally supported from analysis in this paper that such maps preserve functional co-localization of 
the protein structure space. 

Methods: Inspired by a recent application of the Latent Dirichlet Allocation (LDA) model for conducting structural 
comparisons of proteins, we propose higher-order LDA-obtained topic-based representations of protein structures 
to provide an alternative route for remote homology detection and organization of the protein structure space in 
few dimensions. Various techniques based on natural language processing are proposed and employed to aid the 
analysis of topics in the protein structure domain. 

Results: We show that a topic-based representation is just as effective as a fragment-based one at automated 
detection of remote homologs and organization of protein structure space. We conduct a detailed analysis of the 
information content in the topic-based representation, showing that topics have semantic meaning. The fragment- 
based and topic-based representations are also shown to allow prediction of superfamily membership. 

Conclusions: This work opens exciting venues in designing novel representations to extract information about 
protein structures, as well as organizing and mining protein structure space with mature text mining tools. 



Background 

Genome sequencing efforts utilizing high-throughput 
technologies are elucidating millions of protein-encoding 
sequences that currently lack any functional characteri- 
zation [1,2]. The function of a protein of interest can be 
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inferred from other proteins with a common ancestor, 
or homologs, with available functional characterization. 
Either sequence or structure information can be used for 
this purpose. The majority of methods used for genome- 
wide functional annotation are based on sequence 
comparisons and use sequence alignment to identify 
homologous proteins. Well-known sequence alignment 
tools include BLAST [3], PROSITE [4,5], and PFAM [6,7]. 
While typically fast, these tools are restricted to identifying 
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mainly close homologs; that is, pairs of proteins with 
significant sequence similarity. Function can then be trans- 
ferred onto an uncharacterized query protein when the 
sequence alignment tool identifies a homolog with known 
function and no less than 30% sequence identity with the 
query. 

It is often the case that two proteins with similar func- 
tion cannot be inferred based on sequence information 
alone. Sequence-based function inference may miss 
detecting similar proteins where either early branching 
points (in such case the proteins are referred to as remote 
homologs) or convergent evolution has resulted in high 
sequence divergence while largely preserving structure and 
function. Many sequence-based methods have been 
offered to extend the applicability of sequence alignment 
tools for the detection of remote homologs [8-10]. The 
most successful ones, relying on statistical models learned 
over multiple aligned sequences, have been shown to 
improve upon methods based on pairwise sequence com- 
parison but still fail to recognize remote homologs with 
sequence identity less than 25% [11]. It is worth noting 
that about 25% of all sequenced proteins are estimated to 
fall in this category. 

The presence of remote homologs was identified as early 
as 1960, when Perutz and colleagues showed through 
structural alignment that myoglobin and hemoglobin have 
similar structures but different sequences [12]. Because 
structure is under more evolutionary pressure to be 
preserved than sequence, methods that compare struc- 
tures allow effectively casting a wider net at detecting 
related proteins for functional annotation. Structure-based 
function inference promises to detect remote homologs 
and expand options for assigning function to novel protein 
sequences. Many structure similarity methods have been 
proposed over the years, and two comprehensive 
comparisons pitching these methods against one another 
in the context of a gold standard are presented in [13,14]. 
Well-known methods measuring the similarity of two 
protein structures include those based on Dynamic 
Programming (DP) [15-17], including SSAP [18] and 
STRUCTAL/LSQMAN [19-21], methods based on 
distance matrices, such as DALI [22], those based on 
extension of an alignment pinned at aligned fragment 
pairs or groups of residues, such as CE [23], LGA [24], 
TMAlign [25], methods based on comparison of 
secondary structure units, such as VAST [26,27] and 
SSM [28], and those based on comparison of backbone 
fragments [29]. 

Work on effective structure comparison methods has 
been spurred due to the Structural Genomics Initiative 
[30] aiming to determine representative structures of all 
protein families. Such research remains challenging, 
mainly because the problem of finding the optimal 
structure similarity score is ill-posed and has no unique 



answer [31]. While ultimately the purpose is to transfer 
functional similarity to structurally-similar proteins, it 
remains open how biologically significant a particular 
structural alignment is [32,33]. 

The majority of structure-comparison methods obtain a 
structure similarity score after aligning the two protein 
structures provided for comparison. While this is desir- 
able, particularly in cases when the structures need to be 
analyzed in detail for the locations of high similarity 
regions, most structure alignment methods tend to be 
computationally expensive. As such, they are not suitable 
to be applied at a large scale over structural databases of 
proteins for the purpose of detecting structural neighbors 
of a protein of interest. To address this issue, filter 
approaches have been proposed, where the objective is to 
rapidly rule out some structures and employ more expen- 
sive structure alignment tools on the remaining set of 
structures. 

Most filter approaches for structure comparison rely on 
finding suitable representations of protein structure so 
that fast distance measurements can be employed over 
the representations to rapidly score the similarity of two 
protein structures without the computationally-intensive 
step of aligning two structures under comparison 
[34,29,35-41]. The representations are typically string or 
vector-based, and characters or elements are drawn over 
a pre-compiled alphabet or library of structural features. 
Representative filter methods include SGM [42], PRIDE 
[43], and that in [29]. 

In particular, fragment-based representations of protein 
structures have been recendy proposed to allow fast detec- 
tion of remote homologs with reasonable accuracy [29]. 
The representations are based on the bag-of-words 
(BOW) model of text documents, representing a protein 
structure as a bag of backbone fragments. Essentially, a 
representative set of backbone fragments of a given length 
are compiled over known protein structures [44]. A 
protein structure of interest is then represented as a vector 
whose entries record the number of times each of the 
fragments in the compiled library of fragments approxi- 
mates a segment in the given protein backbone. The 
resulting /ragbag representation has been shown efficient 
and effective at identifying structural neighbors of a given 
protein, including close and remote homologs [29]. It is 
worth noting that fragment-based representations have 
also been used for structural alignments [45,46]. 

Due to their efficiency, filter methods are appealing 
beyond large-scale detection of structural neighbors of a 
protein query. They can, through the additional applica- 
tion of dimensionality reduction techniques, organize 
known protein structure space and reveal interesting 
insight on the relationship between sequence, structure, 
and function in proteins [34,47,48]. Current applications 
operate on protein structure space as organized in protein 
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structure databases, such as the "Structural Classification 
of Proteins" (SCOP) [49] and the "Class, Architecture, 
Topology, and Homology" (CATH) databases [50,51]. It is 
worth noting that both databases contain protein domains 
rather than complete protein structures; that is, these data- 
bases break up and organize the known protein structures 
as deposited in the Protein Data Bank [52] in various 
ways. Biologists usually break up large proteins that con- 
tain multiple unrelated domains spliced together into one 
polypeptide based on a process that involves analysis of 
sequence, structure, and domain-specific expertise into 
what constitutes a domain. Both SCOP and CATH are 
hierarchical, as opposed to the "Families of Structurally 
Similar Proteins" (FSSP) database [53]. In SCOP and 
CATH, domains are first grouped/classified together 
based on common secondary structure components (this 
is known as Class), then common arrangement (Architec- 
ture in CATH), topology of secondary structure elements 
(fold in SCOP and Topology in CATH), and then homolo- 
gous superfamilies (Superfamily in SCOP and Homolo- 
gous family in CATH) and sequence families (family in 
both SCOP AND CATH). Unlike SCOP, where the classi- 
fication is largely manual, CATH is more automated and 
explicitly uses sequence and structure-based criteria for 
assigning homology. 

The fragbag representation has been recently employed 
to embed the protein structure space through simple 
linear dimensionality reduction techniques. The obtained 
low-dimensional maps are shown to provide interesting 
insight on the relationship between structure and function 
in the currently known protein universe [47] organized in 
SCOP [49] and CATH [51]. Other representations and 
ensuing maps have been obtained by other researchers 
over the years, showing, for instance, a closer relationship 
between structure and function than sequence and func- 
tion [34]. We confirm some of these findings in this paper, 
showing that an embedding of the fragbag-based space 
through Principal Component Analysis (PCA) is low- 
dimensional and groups structurally-similar domains 
together. 

>In this paper, we present work on a novel low- 
dimensional categorization of the protein structure 
space. We seek representations that separate classes and 
capture the unique structural information in a class 
without relying on posterior dimensionality reduction 
techniques. We investigate a topic-based representation 
obtained through application of the Latent Dirichlet 
Allocation (LDA) model. A topic-based representation 
of protein structure has been proposed recently in [54] 
as an alternative to fragbag, but the study has been 
limited to employment of topics to identify structural 
neighbors of a given protein. We conduct a detailed 
analysis of the quality and information captured by 
topics, building on our previous work on topic-based 



representations of text documents in text mining [55]. 
We additionally demonstrate that a topic-based repre- 
sentation is just as descriptive and accurate as the frag- 
ment-based one not only at identifying remote 
homologs but also at organizing protein structure space. 
In particular, we demonstrate through the use of the 
ChiSquare significance test that many SCOP superfami- 
lies are statistically significant in the definition of the 
topics, essentially giving semantic meaning to topics in 
the same way that a group of text documents gives 
meaning to and defines a certain topic. Moreover, we 
show that the fragbag and topic-based representations 
allow binary classifiers to accurately predict SCOP super- 
family membership of protein structures. We believe the 
work presented in this paper opens exciting venues in 
designing novel representations to extract information 
about protein structures, as well as organizing and 
mining protein structure space with mature text mining 
tools. 

Methods 

We first summarize the fragbag representation of a protein 
structure, followed by a brief description of PCA. The 
LDA model is summarized next, with further description 
of the topic-based representations it offers on proteins and 
the measurements used to conduct the analysis over 
topics. 

Fragbag BOW representation of protein structure 

The fragbag representation is based on the Kolodny 
fragment libraries [44] and is based on the concept of a 
Ca-based molecular fragment. A library of fragments of 
If amino acids in [44] is constructed as follows. 
Fragments of Ca traces of 200 accurately-determined 
protein structures are clustered, depositing the represen- 
tative of each cluster in the fragment library. While ana- 
lysis on the fragbag representation considers fragment 
libraries with fragments of length // e {6, 12}, we 
focus on fragments of length 11 in this paper, shown to 
result in the highest accuracy in identifying structural 
neighbors in [29,54] and our own analysis (data not 
shown). 

The concept of molecular fragments allows obtaining 
a vector-based representation of a protein structure as 
follows. Given a fragment library of F fragments of a fixed 
length If, a protein structure P can be represented as a 
vector y of F entries. Different information retrieval (IR) 
techniques can be used to fill an entry V, associated with 
fragment^ in the library(l < i < F). For instance, entry V; 
can record the presence or absence of fragment^ (stored 
at position 1 < / < f in the library) in P, effectively result- 
ing in a boolean vector. Alternatively, the number of times 
fragment^ is found in P can be used. This is also known 
as term frequency (TR) and is the method employed by 



Molloy et al. BMC Bioinformatics 2014, 15(Suppl 8):S4 
http://www.biomedcentral.eom/1 471 -2 1 05/1 5/S8/S4 



Page 4 of 14 



the /ragbag representation in [29]. Generally, other 
naive vector space models can be used, including term 
frequency-inverse document frequency (TF-IDF) [56] . 

The presence of a fragment ^ in P is detected as follows. 
The Ca trace of P (that is, only coordinates are 
extracted from the protein structure) is inspected at every 
location / in blocks of / consecutive amino acids, or 
segments [/, / +/-!]. The Ca coordinates of the particular 
segment under consideration are compared to each 
fragment f in the library (1 < i < F), and the fragment with 
the lowest least-root-mean-squared-deviation (IRMSD) is 
reported as the fragment matching the particular segment 
(least in IRMSD stands for optimal RMSD after removing 
deviations due to rigid-body motions, and RMSD is the 
Euclidean distance weighted over the number of points) 
[57]. The entire process is illustrated in Figure 1. 

Given this representation, any distance or similarity 
measurements can be used over the fragbag vectors of two 
protein structures to measure their structural distance or 
similarity. In [29], various distance measurements are 
tested, including the basic Euclidean distance as well as 
cosine distance (which measures the angle between two 
vectors). The cosine distance is reported to be most 
accurate and competitive with top structure-alignment 
methods in detecting structural neighbors. 

Low-dimensional embedding of protein structure space 

Given fragbag representations of protein structures, the 
newly defined (fragbag) vector space, which has dimen- 
sionality 400, can be reduced to a few dimensions 
through various dimensionality reduction techniques. In 




1 l< F 

Figure 1 Molecular fragment replacement process A protein 
structure is sliown on the left, rendered with VMD [67] using the 
NewCartoon graphical representation. The protein structure is 
scanned one fragment at a time from the N- to the C-terminus. The 
first fragment is highlighted in red. The position of the fragment in 
the fragment librar/ is identified, and the entry in the BOW vector 
at that particular position is incremented. After the entire structure 
is scanned, the resulting BOW vector is the one supplied to IDA. 



[47], PCA has been used to project SCOP domains on 
the two top principal components (PCs). PCA is a well- 
known linear dimensionality reduction technique, which 
finds an orthogonal transformation of points given in 
some original high-dimensional space such that the 
transformation highlights new axes, also known as the 
PCs, that maximize variance in the projected or trans- 
formed data. Typically, the transformation is said to 
yield a reduced or low-dimensional embedding when a 
few, 3-5, PCs retain more than 70% of the variance in 
the original distribution of the data [58]. We apply PCA 
here, as well, to visualize co-localization of function in 
the protein structure space and qualitatively compare 
these results with the organization readily obtained 
through the topic-based representation we investigate in 
this paper. 

LDA-based topic representation of protein structure 

We propose an alternative representation of protein 
structure in this paper based on topics obtained through 
a popular technique in text mining, LDA. LDA was 
introduced in [59] as a generative probabilistic model to 
find latent groups (topics) that capture the structure of 
observations represented by BOW models, which in this 
setting are generated using the fragbag method. The key 
idea, first introduced in [54] but limited to detection of 
structural neighbors, is to represent proteins as 
probability distributions over latent topics, which are 
themselves probability distributions over fragments in 
the fragment library. This idea builds on the original 
one introduced to categorize text documents of a given 
corpora by the topics covered in each of them. In text 
mining, however, visual inspection of the words of 
highest probability in each topic allows giving semantic 
meaning to topics. Associating semantic meaning to 
protein fragments (analogous to words in this setting) 
is not easy, and we provide in this paper a series of 
analysis techniques to do so. 

We briefly describe the concepts of LDA and how they 
map to our investigation of proteins. The graphic model 
for LDA is shown in Figure 2. The generative process in 
this model functions as follows. First, a multinomial 
distribution, (pt, is assigned to each topic 1 <= t <= T. 
Each of these distributions represents the probability of 
each fragment in F participating in topic t. For each 
protein P, that is constructed, we obtain a mixture of 
topics by assigning another multinomial distribution, 0i. 
Each fragment in protein P, is generated by first selecting 
a topic t according to (9„ and then using that topic's dis- 
tribution (pt to select the actual fragment. Each fragment 
within each protein represents a latent variable, z„ that is 
assigned to a specific topic. The assignment of multino- 
mial distributions is obtained from a Dirichlet distribu- 
tion, which is the conjugate prior for the multinomial 
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Figure 2 LDA plate. T is number of topics, N is tlie number of 
protein structures. Each fragment within a protein is represented by 
f and rij is the number of fragments in P,-. Blue and blacl< 
bacl<grounds indicate latent and observed variables respectively. 



distribution. As such, each sample from a Dirichlet yields 
a multinomial distribution. Separate Dirichlet distribu- 
tions are used for sampling the distribution of topics 
within a protein (0,) and for the distribution of fragments 
within a topic ((ft) and are parameterized by a and [} 
respectively. 

The goal in LDA is to maximize the likelihood of the 
posterior through the refinement of the topic assignments 
z,. This is accomplished using the LDA algorithm from 
[60]. This method initially assigns each z, to a random 
topic and then utilizes many iterations of Gibbs sampling 
to approximate the (p and 0 distributions. We direct the 
reader to [60] for a more detailed discussion of LDA and 
this specific approximation algorithm. 

In this context, topics make for general representations 
of proteins, under which a protein is treated as a mixture 
of many topics, albeit with different probabilities. As we 
relate in Results, one can employ these topic-based repre- 
sentations to identify structural neighbors of a protein. 
We additionally show how topics categorize the protein 
structure space, revealing interesting insight into what it 
is that each topic captures about protein structure and 
function. 

Evaluating information content in topics 

One of the parameters in LDA is the number of topics T. 
Tuning T can be accomplished by measuring the infor- 
mation gain provided in each topic compared to a base- 
line [55]. The distribution of fragments over the entire 
protein structure space, as available in SCOP, for 
instance, can be used to represent a baseline distribution 
over fragments. Each topic obtained by LDA is also a 



probability distribution over fragments. We use the sym- 
metric KuUback-Leibler (KL) divergence [61] to measure 
the information gain of each topic over the baseline dis- 
tribution. Briefly, given two probability distributions 

Po and p2, KL(po,pi) = X^poW • 'n^^^4^ We use a 

symmetric version of KL defined as 0.5 
(KL(Pq,Pj) + KL(Pj,Pq)). Larger distances imply higher 
information gain in each topic as opposed to the baseline 
distribution of fragments over the entire corpora. Small 
distances imply that the topic is essentially junk, provid- 
ing no additional semantic content as compared to the 
baseline. This evaluation is carried out for each topic in 
the Results section to additionally measure the informa- 
tion gain as one increases the number of topics requested 
from LDA. 

In addition, log likelihood evaluates how well the data 
(the fragments defining protein domains) fits the model, 
which in this case is the topic space model produced 
by LDA. When performing parameter estimation, a 
common strategy is to maximize the log likelihood as 
proposed in [62]. We employ this technique to measure 
the effectiveness of each LDA model, varying the 
number of topics T. Let M represent all the parameters, 
including T, for the LDA model. Equation 1 shows the 
likelihood of M generating the set of proteins P. Taking 
the log of both sides yields Equation 2. Equation 3 
shows the calculation for computing each protein Pi, 
and taking the log of both sides yields Equation 4. F is 
the total number of fragments used to describe the 
ensemble and nf^ is the number of times fragment v 
appears in protein P,. P{fv\t/t) is the probability of the 
fragment /v being in topic tk, which is provided by the 
multinomial distribution <p^. P{tii:\Pi) is the probability of 
topic t/^ being in protein Pi, which is provided by di. 
These measurements are shown in the Results section 
to demonstrate that the log likelihood decreases as the 
number of topics increases. 

N 

p{P\M]=Ylp{P,\M) (1) 

1=1 

N 

logp[P\M) = J2^og p{P,\M) (2) 
1=1 

„(") 

PiPi\M) = u(j:(l^i,vOi,X (3) 

v=l \fe=l / 

log p{Pi\M) = E nl"' * log (j: (p(/j£fe)p(tfe|P0) (4) 

i'=i \fe=i / 
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Topic signatures of structural classes and co-localization 
In protein structure space 

Each topic may capture "signatures" associated with 
different classifications (SCOP, CATH). To test for 
these signatures, we propose using heatmaps 
constructed over the LDA-computed topic space. LDA 
presents the topic space as a N x T matrix, where N is 
the number of proteins and T is the number of topics. 
The row vector for protein P, records the number of 
times a fragment is classified to be within a given topic 
Tj. Additionally, each protein is assigned a label accord- 
ing to some classification standard; a label corresponds 
to a class. For instance, a label may be the fold of the 
protein, as obtained from the top level of the SCOP 
hierarchy. Alternatively, the label can track the super- 
family membership of a protein in SCOP. 

Many protein domains are assigned the same label Z,,-. 
We sum fragment counts for topic Tj on each protein 
assigned the same label i,. This provides us with a frag- 
ment count for topic Tj in label L,-. Normalizing over all 
labels provides us with probability 7}). This produces 
an i X r matrix, where each column in the matrix sums to 
one. Results in this paper visualize this matrix as a heat- 
map, with colors following the low-to-high probabilities in 
a blue-to-red colors scheme. 

When protein classes have strikingly different sizes, the 
above analysis will be skewed. A high probability P{Li\Tj) 
may be assigned to a class with label L, simply because of 
the high number of domains in the class with label L,. 
This situation arises when analyzing topic signatures over 
the superfamily classification in SCOP. In this case, we 
take a different approach to obtaining a heatmap that 
elucidates topic signatures for protein classes. We employ 
the ChiSquare significance test [63] at a confidence level 
of 99%. This analysis is performed for each topic 7}. For 
each protein with label L,-, we compute the number of 

fragments found within topic 7} (let's refer to this as C^), 
and the number of fragments that are not assigned to 
proteins with this label (Cj^'). We compute these counts 
for the entire population minus the topic we are cur- 
rently analyzing (C^t. andCIlJ?). These value are used to 

construct a contingency table and perform the ChiSquare 
significance test. When the test shows a significant differ- 
ence, and the population in the topic is greater than the 
remainder of the population, we characterize this topic as 
having a signature for the label under consideration. 

Predicting superfamily membership of protein structure 

We demonstrate that the fragbag and topic-based repre- 
sentations can be employed by machine learning classifica- 
tion algorithms to predict superfamily membership for a 
given protein structure. Since this is a multiclass classifica- 
tion problem, we employ the one-vs-all strategy, using 



7 binary classifiers, one for each of the 7 most-populated 
superfamilies in SCOP. We employ the popular Support 
Vector Machines (SVM) for the binary classifier [64]. 

The set of 9,852 protein domains in these superfamilies 
is extracted, and LDA is applied to this set. When using 
the topic -based representation, each protein's multinomial 
distribution across the topic space returned by LDA serves 
as its coordinates in the 10-dimensional space (our analy- 
sis in the Results section makes the case that no more 
than 10 topics are needed). The resulting 10-dimensional 
vectors are treated as a training dataset, and 7 classifiers 
are built (SVM is a binary classifier) in order to predict 
superfamily membership with binary classifiers. When 
using the fragbag representation, the training vectors are 
400-dimensional as opposed to topic vectors which are 
10-dimensional. 

When building an SVM classifier for superfamily i {1 < i 
< 7), the set of training vectors corresponding to domains 
in that superfamily are treated as the positive training 
dataset. The rest of the vectors, corresponding to domains 
in other superfamilies are treated as the negative training 
dataset. We note that for some of the superfamilies, there 
are many more negative instances than positive ones, as 
expected. In such cases, re-balancing of data is performed 
by undersampling the negative class in order to achieve an 
equal count of positive and negative instances. 

Each SVM classifier is trained independently (on each 
superfamily), using a polynomial kernel and a soft margin 
parameter C = 1.0. Ten-fold cross-validation is used to 
measure the classification performance, as related in the 
Results section. For each protein domain, the prediction 
among the 7 classifiers that has the highest confidence is 
chosen as the final prediction for that domain. In this 
way, superfamily membership is predicted for each family, 
and standard TPR, FPR, and accuracy measurements can 
be used to evaluate performance. 

Results and discussion 

Implementation details, datasets, and experimental setup 

We use a MATLAB implementation for LDA [60]. All 
our experiments and analysis are executed on a 2.4Ghz 
Core 17 processor. Parameter values for LDA are a = 50/ 
(number of topics) and ji = 200/(fragment library size). 
Extracting the fragbag representation for each protein 
domain in a dataset of 31,155 domains (datasets are 
detailed below) takes 10 hours. LDA runtimes depend on 
the number of topics requested and vary from 2 hours 
for 10 topics to 24 hours for 200 topics. The following 
analysis conducted here is organized in four sets of 
experiments. The WEKA data mining package [65] is 
employed for training SVMs on superfamily-labeled 
protein structures as described in the Methods section. 

We first tune LDA varying the number of topics to 
show that most information can be obtained with a 
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relatively small number of topics. The topics that allow 
obtaining comparable results in this context are then 
analyzed in detail in terms of what fragments they 
capture. This allows associating "semantic" meaning to 
topics in terms of the over-represented fragments they 
contain. 

Second, we demonstrate that the representation of a 
protein domain through LDA-obtained topics, as 
described in Methods, is just as useful as the fragbag 
representation to capture structural similarity and report 
structural neighbors with comparable accuracy. We do so 
over a database of 2,930 sequence-nonredundant struc- 
tures, extracted from CATH, as in [29,54]. Each structure 
in this dataset is treated as a query, and structural neigh- 
bors are identified for it over the rest of the dataset. This 
process is repeated for each structure in the dataset to 
obtain the average area under the curve of receiver operat- 
ing characteristic (ROC) curves [66]. We place these 
results in context, comparing to representative structure 
alignment and filter methods. 

Our third set of experiments concerns how topics can 
be used to organize protein structure space as compared 
to the fragbag representation. This analysis is placed in 
context by first demonstrating the usefulness of the frag- 
bag representation in obtaining a low-dimensional map of 
the protein structure space through PCA. We restrict our 
analysis and visualization to two levels in the SCOP hierar- 
chy, class and superfamily. The dataset we employ to 
demonstrate the co-localization of structurally- and func- 
tionally-similar proteins (according to classes in a SCOP 
hierarchy) consists of 31,155 protein domains extracted 
from SCOP 1.71 [49]. This dataset is kindly provided to us 
by R. Kolodny, and our choice of this dataset is so that 
direct comparisons can be drawn with work by Kolodny 
and colleagues in [47]. We focus the analysis to top- 
populated families in the two chosen levels, class and 
superfamily, in the SCOP hierarchy for clarity. We show 
that classes have unique topic signatures, which further 
supports our conclusions that LDA-obtained topics are 
general and informative representations of protein 
domains. They can be employed to detect remote homo- 
logs and obtain further insight about the organization of 
the protein structure space. 

Our fourth and final set of experiments demonstrates 
that the topic-based representation captures important 
information about a protein structure that allows predict- 
ing superfamily membership. Binary classifiers are used for 
this purpose to predict one of the 7 most-populated super- 
families for given protein structures. Our results show that 
both representations allow standard classifiers to achieve 
high prediction accuracy, which we believe opens the way 
towards using simple representations for automated and 
reliable hierarchic classification of proteins in databases 
such as SCOP and CATH. 



Less is more: topic space is low-dimensional 

We show that increasing the number of topics results in 
topics of low information gain, demonstrating that the 
chosen number of 10 topics is appropriate. We compute 
the symmetric KL distance, as described in Methods, to 
measure the information gain of each topic over the base- 
line distribution of fragments over all SCOP domains. We 
do so for 11 different settings of T, starting with T = 10 
through T = 200. Figure 3 highlights the value of the KL 
distances for three settings of T (10,100,200). To formulate 
a quantitative comparison, we compute the mean and var- 
iance of each set of KL distances for each of the 11 settings 
of T, which is shovm in the bottom right panel of Figure 3. 
This analysis illustrates that the mean KL distance 
decreases as the number of topics increases, and the var- 
iance increases as the number of topics increase. This sug- 
gests that increasing the number of topics does not result 
in more information and that many topics are essentially 
"junk" topics for the larger values of T [55]. 

Additionally, we show the log likelihood, measured as 
detailed in the Methods section, for various settings of T 
in Figure 4. As the number of topics increases, the log 
likelihood decreases. Combining this analysis with that on 
information gain clearly demonstrates that more topics is 
not necessarily better. Moreover, these results support the 
choice of 10 topics as sufficient for the rest of our analysis. 
It is worth emphasizing that, from now on, a protein 
structure is represented as a 10-dimensional vector (where 
each entry in the vector records the probability with which 
that topic is "found" in the structure). This lies in contrast 
to the higher-dimensional vector space resulting from the 
fragbag representation where 400 fragments are employed 
as opposed to 10 topics. One of the advantages of this 
lower dimensionality is that dimensionality reduction 
techniques do not have to be used in order to provide 
low-dimensional user-friendly embeddings or maps of 
protein structure space. A component of our analysis 
below illustrates how topics are signatures of SCOP classes 
and can even be employed to accurately predict superfam- 
ily membership. 

Before relating results into how the topic-based repre- 
sentation compares to fragbag and other methods in 
detecting remote homologs and organizing protein struc- 
ture space, we provide further insight into what the topics 
capture. In text mining, peeking into the top populated 
word(s) readily provides semantic meaning into what a 
topic captures. It is not possible to directly do so in the 
protein structure space. However, inspecting the top frag- 
ment(s) (for lack of space, we limit the visualization to 
only the top fragment) and correlating this information 
with analysis on classes most likely to be associated with 
certain topics provides information into the meaning of a 
topic in the protein structure space. The top-populated 
fragments in each topic are shovm in Figure 5. 
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Figure 3 LDA topic information analysis. Symmetric KL distance between eacli topic and the baseline fragment distribution over the entire 
corpora is shown. Three settings of LDA are compared, where the number of topics varies from 10, to 100, to 200. A quantitative comparison is 
shown where the number of topics is evaluated at 1 1 different values. 



Detection of close and remote homologs: topics capture 
structural similarity 

We first compare the ability of the topic-based represen- 
tation vs. fragbag to identify structural neighbors of a 
protein. We recall that the dataset employed for this 
analysis is the sequence-nonredundant dataset of 2,930 
protein structures extracted from CATH. Each protein 
in this dataset is treated as a query. The gold standard 
on which proteins in the dataset are determined to be 
structural neighbors of a query protein is obtained by a 
best-of-six structural alignment protocol, courtesy of R. 
Kolodny. Three different structural alignment scores 
(SAS) of 5, 3.5, and 2.0A are employed. A SAS threshold 
of 2. OA allows identifying close homologs of a protein, 
whereas a threshold of 5A identifies remote homologs. 
Given a particular SAS threshold and the gold standard 



of structural neighbors obtained with that threshold, the 
following experiment is conducted. 

Employing the fragbag or topic-based representation 
and the cosine distance over the particular representa- 
tion under investigation and continuously varying the 
decision threshold (that is, the cosine distance between 
two protein structures under the particular representa- 
tion), a receiver operating curve (ROC) can be con- 
structed, and the average area under the curve (AUC) 
score can be reported. The ROC curve plots the true 
positive rate (TPR = TP/(TP+FN)) vs. the false positive 
rate (FPR = FP/(FP+TN)) over the decision threshold. 
Summarizing the ROC with AUC allows associating a 
score with each query protein. Averaging over all pro- 
teins in the dataset, essentially treating each of them in 
turn as a query protein, allows obtaining an average 
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Figure 4 Topic analysis using log likelihood The log likelihood 
of fitting the data is shown for 1 1 LDA models, where the number 
of topics varies from 1 0 to 200. 



AUG and thus measuring the effectiveness of a particu- 
lar representation at capturing structural neighbors. Per- 
forming this analysis at the three different SAS 
thresholds further allows judging the effectiveness at 
capturing close to remote homologs. 

Figure 6 compares the average AUCs obtained using 
fragbag and our topic-based representations and addition- 
ally places them in a larger context by comparing them 
to two methods, SSM [28], representative of alignment- 
based methods, and SGM, representative of filter meth- 
ods [42]. The average AUCs reported for these methods 
are obtained as published in [14]. Additionally, average 
AUCs obtained over topics as reported in [54] with 10 
topics are shown. Figure 6 shows that SSM is the best 
performer, followed closely by fragbag and the rest. LDA 
and SGM are comparable. 

In particular, the average AUCs on each SAS threshold 
obtained with the fragbag and topic-based representations 



are listed in Table 1 for a direct comparison. Two obser- 
vations can be drawn. First, both representations, fragbag 
and topic-based, are equally effective at capturing struc- 
tural neighbors at each of the three SAS thresholds. Sec- 
ond, under each representation, the effectiveness is 
higher at lower SAS thresholds (above 0.8 at a SAS 
threshold of 2. OA), allowing us to conclude that the 
representations have an easier time capturing close 
homologs than remote homologs. However, performance 
on remote homologs remains good (higher than 0.7 at a 
SAS threshold of 5 A ). Taken together, this experiment 
allows concluding that the topic-based representation 
allows capturing structural similarity and can be 
employed to rapidly extract structural neighbors (close 
and remote homologs) of a given protein with known 
structure. 

Automated mapping and organization of protein 
structure space 

We now proceed to demonstrate how the fragbag and 
topic-based representations can be used to provide low- 
dimensional maps or categorizations of the known protein 
structure space. 

Analysis of fragment-based embeddings of protein structure 
space 

We conduct a PCA analysis on the SCOP dataset 
described above. The accumulation of variance on the 
ordered eigenvalues, plotted in Figure 7 (top panel), 
shows that the first two PCs capture more than 99% of 
the variance, demonstrating that projection on these two 
PCs provides an informative low-dimensional space of 
the protein structure space. We visualize such a map in 
Figure 7 (middle and bottom panels). We employ differ- 
ent color-coding schemes to track proteins that belong to 
the same fold or the same superfamily in SCOP. 

Figure 7 (middle panel) shows the highest-populated 
classes in the first level of the hierarchy; these are, 
namely, a, P, a + ji, and a/p proteins. The PCA map in 



(1) (2) (3) (4) (5) 

(6) (7) (8) (9) (10) 

Figure 5 Top-populated fragment. The top-populated fragment of each topic is shown. 
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Figure 6 Average AUCS over SCOP. The average AUCs over the 
SCOP dataset, calculated as described in the Results section, are 
compared among different methods. Data from the SGM and SSM 
methods are obtained as published in [14]. These two methods are 
compared against the fragbag and two topic-based representations 
(as published here and in [54]). 



Figure 7 (middle panel) clearly shows that the first PC 
captures most of the all-a proteins, whereas the second 
PC captures most of the all-f) proteins. There is more 
variation in the proteins assigned to the all-/? class, but a 
closer inspection reveals some of these proteins contain 
one or a few a-helices (data not shown). As expected, 
the other two folds, which combine a-helices and 
P -sheets, span the space. The layout of protein folds in 
this low-dimensional map is in agreement with other 
studies [47,34]. 

Figure 7 (bottom panel) selects six top-populated SCOP 
superfamilies. Proteins in a superfamily have similar func- 
tion. In agreement with the study in [34], which pursues a 
Multi Dimensional Scaling (MDS) mapping of the protein 
structure space (employing a different parameterization), 
the two-dimensional map revealed from the PCA analysis 
shows good functional co-localization of these superfami- 
lies. That is, proteins in the same superfamily are also 
neighbors in the projected space. This result further illus- 
trates the usefulness of low- dimensional maps that allow 
visualization of the protein structure space. 

It is interesting to note that the fragbag representation 
essentially unravels the non-linearity in the protein 
structure space. In other studies, most notably by Kim 
and colleagues [34], MDS has been central to obtaining 
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Figure 7 PCA analysis of SCOP domains Top panel shows 
accumulation of variance from PCA. The top two PCs capture more 
than 99% of variance. Middle and bottom panels show the projection 
of SCOP domains on the top two PCs. Different colors are used to 
separate classes (middle panel) and superfamily (bottom panel). 



an accurate low-dimensional projection of the structure 
space. The parameterization of a protein structure in 
that study was not based on a BOW representation. 
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Topics have semantic meaning in ttie protein structure 
space 

Taken together, the above analysis suggests that topic 
space is an informative low-dimensional embedding of the 
protein structure space that allows capturing structural 
similarity. To complete the analysis, we elucidate topic sig- 
natures per SCOP class at different levels of the SCOP 
hierarchy. The heatmap shown in Figure 8 color-codes 
topics per class at the fold level of the SCOP hierarchy in 
a blue-to-red color scheme tracking low-to-high probabil- 
ities measured as detailed in Methods. The results shown 
in Figure 8 suggest that topics 1-4 are over-represented in 
the a class but under-represented in the f} class. This is 
reversed for topics 5-10. In contrast, the other classes 
either have a high mixture or a low mixture of each topic. 
Correlating these results with those shown in Figure 5 
provides an explanation for why this is the case. Topics 1- 
4 are related to a-helical topologies, as evidenced by the 
top fragment shown. Topics 5-10 are related instead to fi- 
sheet topologies. Put together, these results demonstrate 
that classes at the fold level of the SCOP hierarchy have 
unique topic signatures. It is worth emphasizing that this 
result is made even stronger when considering that, often, 
domains assigned to the fi class may contain a few a- 
helices (data not shown). The analysis suggests that topics 
capture structural categorization. 

The heatmap in Figure 9 is prepared through the techni- 
que detailed in Methods to correct for the high variance in 
population sizes of top superfamilies in SCOP. Blue indi- 
cates low presence of a topic, and red indicates high pre- 
sence. The results shown in Figure 9 suggest that 
superfamilies have unique topic signatures. For instance, 
the immunoglobulin domain has many of topics 5-10 
over-represented. This is encouraging, as inspection of 
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Figure 8 Heatmap topic analysis on SCOP folds iHeatmap 
higiilights "signature" topics per class in tlie fold level of the SCOP 
hierarchy. Blue-to-red color scheme tracks low-to-high probabilities 
measured as detailed in Methods. 
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Figure 9 Heatmap topic analysis on SCOP classes. Heatmap 
highlight "signature" topics per class in the superfamily level of the 
SCOP hierarchy. Blue-to-red color scheme tracks low-to-high 
probabilities measured as detailed in Methods. 



these topics in Figure 5 reveals that they are high in 
/^-sheets, and immunoglobulin domains are all-;8 proteins. 
On the other hand, the P-loop Binding domain is rich 
in a-helices. Encouragingly, the topics that are over- 
represented in this superfamily are topics 1-4, which cap- 
ture a-helical fragments, as shown in Figure 9. The winged 
helix DNA-binding domain is significantly represented in 
topics 1 and 3, both having high concentration of a-helical 
fragments. This agrees with the SCOP classification of this 
domain as all a. Similarly, EF-hand is only significantly 
represented in topic 1, which is dominated by a-helical 
fragments. This is in agreement with the all a SCOP clas- 
sification. The topic signatures capture the other superfa- 
milies, as well, suggesting that topics additionally capture 
functional categorization. 

Predicting superfamily membership 

Finally, a set of 7 classifiers is built as described in the 
Methods section. This experiment is repeated twice, 
once using the fragbag and the other using the topic- 
based representation. The distribution of the protein 
domains employed as training data in each case across 
the 7 superfamilies is shown in Figure 10. The perfor- 
mance of each of the 7 SVM classifiers in 10-fold valida- 
tion is shown in Table 2. Very high accuracy (> 80%), 
TPR (> 0.8), AUC (> 0.83), and low FPR (< 0.3) are 
obtained on each superfamily whether using fragbag or 
the topic-based representation. The fragbag representa- 
tion allows for slightly better classification performance. 
These results confirm that the topic-based representa- 
tion, while only 10-dimensional as compared to the 
400-dimensional fragbag representation, can be used to 
build effective classifiers of proteins, even at the super- 
family level of detail. 



Molloy et al. BMC Bioinformatics 2014, 15(Suppl 8):S4 
http://www.biomedcentral.eom/1 471 -2 1 05/1 5/S8/S4 



Page 12 of 14 



C 

o 
o 

a 
o 

Q. 
C 
D 

£ 
o 

Q 



7000 
6000 
5000 
4000 
3000 
2000 
1000 
0 





■ P-Loop Binding 

■ immunoglobin 
t]NAD(P)-binding Rossman 
□Thioredoxin-like 
□alpha/beta - Hydrolases 

■ EF-Hand 

■Winged Helix DNA-Binding 


.1 





SCOP Superfamiles 



Figure 10 SCOP superfamily distribution. The distribution per 
superfamily is sliown for the protein domains in tine 7 most- 
populated superfamilies in SCOP. These domains are treated as 
training data for SVMs to classify proteins by superfamily. 



Conclusions 

In this work we have investigated a novel low-dimen- 
sional categorization of protein structure space combin- 
ing mature and popular tools in text mining with work 
in structural bioinformatics. The LDA-obtained topic 
representation of protein structure is analyzed in detail 
for its ability to summarize a protein structure with 
multinomial distributions. Our investigation reveals that 
indeed meaningful topics can be discovered in protein 
structures, and that these topics can in turn be used to 
reveal similar protein structures and organize protein 
structure space. 

In particular, results presented in this work suggest that 
topic-based categorization of protein structures preserves 
structural and functional co-localization. Specifically, 

Table 2 SCOP SVM Classification Results. 





Fragbag 
Representation 


Topic-Based 
Representation 


SCOP Superfamily 


Acc. 

(%) 


TPR 


FPR 


AUC 


Acc. 

(%) 


TPR 


FPR 


AUC 


P-Loop Binding 


96.4 


0.98 


0.05 


0.95 


84.3 


0.97 


0.29 


0.84 


mmunoglobin 


100.0 


1.00 


0.00 


1.000 


99.9 


0.99 


0.0 


1.0 


NAD(P)-binding 
Rossman 


98.7 


0.99 


0.02 


0.99 


90.9 


0.94 


0.13 


0.91 


Thioredoxin-lil<e 


98.8 


0.98 


0.01 


0.99 


80.2 


0.92 


0.32 


0.80 


alpha/beta 
Hydrolases 


991 


1.00 


0.02 


0.99 


92.7 


0.95 


0.10 


0.93 


EF-hand 


100.0 


1.00 


0.00 


1.000 


98.8 


0.99 


0.01 


0.99 


Winged helix DNA- 
binding 


98.7 


0.98 


0.01 


0.99 


844 


0.79 


0.11 


0.84 



Performance is reported for the 7 SVM classifiers identifying a protein domain 
as being a member of one of the seven SCOP superfamilies. Accuracy (Acc.) is 
the sum of true positives and true negatives divided by the number of 
samples. Reported values are rounded up after the second decimal sign. 



topics obtained through LDA are shown to capture struc- 
tural similarity with sufficient accuracy on both close 
and remote homologs and additionally yield a low- 
dimensional organization of the protein structure space 
that preserves groupings by structure and function. 
Topics are also shown to provide sufficient discriminative 
power to standard supervised learning classifiers like 
SVMs for predicting superfamily membership. Taken 
together, the results suggest that the LDA-obtained topic 
representation of protein structure can be used to aid 
classification in structural databases. 

The work presented in this paper opens exciting new 
venues in extracting and organizing information about 
protein structures and protein structure space through 
mature tools in text mining. We additionally hope that 
this work can inspire further investigation of higher- 
order representations of protein structures both for 
structure comparison and for investigating the relation- 
ship between protein sequence, structure, and function. 
Specifically, future work may choose to further mine 
and refine the topic-based representation in a way that 
provides visually-friendly categorizations of protein 
structure to potentially assist hierarchic organizations in 
current structural databases, such as SCOP and CATH. 
Additional future work can explore employment of LDA 
over structure components others than backbone 
fragments. 
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